Fastcdc #11

dpc · 2017-07-30T00:45:49Z

No description provided.

dpc · 2017-07-30T00:46:36Z

This is on top of #10 .

Still a WIP - I need to confirm that it actually works as expected etc.

dpc · 2017-07-30T00:47:21Z

@RAOF You've mentioned that you're planing to work on FastCDC. Currently, I'm not even sure if I correctly understood the whitepaper, but here it is, in case you're interested.

For reasons unknown, this seed generates weird results.

aidanhs · 2017-08-02T14:03:48Z

Another implementation - https://github.com/dswd/zvault/blob/master/chunking/src/fastcdc.rs

dpc · 2017-08-02T16:53:02Z

Oh, nice! There was so much stuff that I was unclear about, and I think here it actually might be right.

It doesn't look to me it is going to be fast though. The API enforces copying, and nested conditions inside a loop: https://github.com/dswd/zvault/blob/master/chunking/src/fastcdc.rs#L103 are probably not going to get optimized away. As far as I understand the bigest chunk of speedup was being able to skip rolling through the initial bytes. Normalized Chunking was supposed to help counter-effect the deduplication ratio loss, without affecting the speedup.

At very minimum, I should write a test that compares the two implementations, and then beat the code into submission until both versions yield the same results. :)

@dswd I need more time to review but in the meantime: https://github.com/dswd/zvault/blob/master/chunking/src/fastcdc.rs#L24 - what is the deal with these masks? In Ddelta paper Gear was described as just using most significant bits, so why can't it be used in FastCDC?

What do you think about statically including the lookup table? It's just 2K.

dswd · 2017-08-02T20:42:43Z

Hi, I reviewed my code and implemented some improvements

I added some more tests to verify that changes to my code still result in the same chunk sizes.
I moved most conditions out of the loop, similar to the code in this PR

On the masks: The paper describes the masks for 8k avg size and mentions that spreading the 1 bits out in the masks yields a slightly better deduplication. So that is what I am doing in that method: I am using a simple PSRNG seeded to the given seed to set the correct number of random bits on the masks. The major thing here is that the masks depend on the seed, otherwise you could just use a static one.

On static lookup table: I can't do this as my table is seeded, but I don't see a problem with a static table if you do not need this feature. However, the computing effort in calculating the table is negligible and I doubt it will yield any improvement in accessing it during chunking.

On the performance: On my laptop, my optimized code reaches over 1000 MB/s while the code of this PR only reaches below 500 MB/s (my previous unoptimized code had the same speed). I don't know exactly why but I suspect that my dummy sink allows the compiler to omit the costly memcpy calls. Surely this is not the most elegant way to get the positions but it seems fast enough. I have no clue why your code is so much slower; it should be at least as fast as mine.

dpc · 2017-08-02T22:43:07Z

Very interesting! We should definitely investigate how far we can push our two versions. :)

I glanced quickly https://github.com/dswd/zvault/blob/master/chunking/benches/all.rs#L107 , and I am suspicious that DevNull allows compiler to unfairly (unlike in th real code) optimize everything away.

Similiarly in my benchmarks, there is such risk, but since it is not that much faster, I haven't thought about it.

At the bare minimum we should put the results into https://doc.rust-lang.org/1.1.0/test/fn.black_box.html so prevent (somewhat, helpfully) unfair opitmizations.

RAOF

Some nits, and suggestion for later refactoring

RAOF · 2017-08-03T00:45:11Z

src/fastcdc.rs

+            if self.current_chunk_size < avg_size {
+                let roll_bytes = cmp::min(avg_size - self.current_chunk_size,
+                                          buf.len() as u64);
+                let result = self.gear.find_chunk_edge_cond(buf, |e: &Gear| (e.digest() >> min_shift) == 0);


This is missing the padding-zeros optimisation for deduplication efficiency (likewise the large-chunk calculation)

I don't get the padding zeros. Since the underlying Gear (as described in Ddelta paper) is taking most significant bits, isn't that the ultimate padding? I was so confused but this.

Hm, you're quite right. I've misread this; the window here is essentially the width of the digest; 64 bytes.

The authors of the paper describe the mask they use in the algorithm as being empirically derived, but infuriatingly not giving any details about it. You'd think that taking the largest window would be best, but apparently not? Apparently it works best when the contributing bits are split approximately uniformly across the 64bit digest?

it works best when the contributing bits are split approximately uniformly across the 64bit digest?

But why? :D I am not very academically minded person, but I found this paper rather confusing is many places. A lot repeating the obvious, and glancing over the important details.

Well, @dswd has the correct implementation https://github.com/dswd/zvault/blob/master/chunking/src/fastcdc.rs here, so we can just use it. :)

RAOF · 2017-08-03T00:47:08Z

src/fastcdc.rs

+
+        let mut cur_offset = 0usize;
+
+        loop {


What is this loop for? As far as I can tell it will loop exactly once, because all codepaths return?

RAOF · 2017-08-03T00:50:48Z

src/fastcdc.rs

+
+
+impl Engine for FastCDC {
+    type Digest = u64;


This really emphasises that trait Engine is an incorrect abstraction for CDC. FastCDC doesn't actually have a Digest, nor can you roll a single byte.

(Similarly, AE and MAXP can't even pretend to have a digest, because they're not even approximately hash-based)

True. However Engine was all we have ATM. :)

Interesting point, I had only been looking at rolling checksums (hence the crate name) and hadn't really thought about non-checksum based alternatives for doing chunking.

RAOF · 2017-08-03T05:00:29Z

But why? :D I am not very academically minded person, but I found this paper rather confusing is many places. A lot repeating the obvious, and glancing over the important details.

That's an excellent question! This appears to be a recurring problem in CDC papers I've read.

It would have been really nice to see their derivation of their masks - what did they try? What was the range of deduplication efficiencies? How stable is their choice - is there a dataset upon which it is really good, and others on which it's mediocre, or is it reliably good for all datasets they tried? 🤷‍♀️

dswd · 2017-08-03T05:44:46Z

My new code is using black_box which does not seem to change the performance. Also I changed my dummy sink to calculate the split positions to be comparable to your code.

Speaking about the curios masks, I am also very unsure why they want to distribute the bits uniformly. It seems they do not really want to have a 64 byte window.

dpc · 2017-08-03T06:04:46Z

src/fastcdc.rs

+
+            // ignore bytes that are not going to influence the digest
+            if self.current_chunk_size < ignore_size {
+                let skip_bytes = cmp::min(ignore_size  - self.current_chunk_size, buf.len() as u64);


Doing this mins outside of condition might perform better. Like here: https://github.com/dswd/zvault/blob/master/chunking/src/fastcdc.rs#L99

dpc · 2017-08-03T06:19:11Z

@dswd I will investigate your code and benchmarks more at some point, but as it is right now I actually like your code somewhat better anyway. :D

The one thing that I don't like are these Read and Write arguments. I am still scratching my head how your code can be faster with that copying there. Probably it gets totally optimized away. But in real code it might not. In rdedup I will already have the data in Vec in SGData or something like that. And I want to use chunks in place, without copying anywhere.

Also, in my implementation, I do roll through the 64 bytes before min_size as they do affect the checksum for first bytes after min_size. I doubt it matters in practice, so maybe it could be ignored for performance. Donno.

In my version I use 8 for both min_size and max_size which leaves your code with bigger min_size so there's more to skip ... Probably not going to affect the performance that much.

OK. Will get back to it at some point. :)

dswd · 2017-08-03T06:44:41Z

The Read and Write types have the main advantage that it will be optimized away if the Write type is ignoring the actual data (at least it seems so) and if you really need to copy the data somewhere you can copy it directly to where it should go (by giving the destination Write) instead of some buffer.

I thought about including data before min_size too, but in the end those bytes do not matter. With the chunk sizes we are talking about (8k+) the chances of missing a cut point in the first bytes after min_size because of the 64 bytes before it are negligible.

In my experiments, skipping the min_size bytes did not give so much of a boost. Theoretically, it should give up to 25% as I am skipping this much input data but for me it only gives about 15%. Maybe this is because those bytes would only be hashed but the hash is not compared against any mask. So if you are skipping 12.5% instead, my maximal speed benefit can be about 12.5%, but I would guess that it is below 10%.

aidanhs · 2017-08-03T19:59:02Z

Just a minor note - please don't copy code directly from @dswd as our licenses are not compatible (assuming the chunking code has the same license as zvault itself) :)

dswd · 2017-08-03T20:53:03Z

The chunking code, especially the fastcdc part, is not that much code, so I don't object to MIT/BSD licenses.

I hereby grant permission to use my chunking code (https://github.com/dswd/zvault/tree/master/chunking) under the terms of either the MIT license or BSD license.

If you plan a combined crate with lots of chunking algorithms, I am also happy to move my code there and use that crate as dependency as long as it provides an efficient interface for my use case.

dpc · 2017-08-04T05:27:47Z

The above are inner-loops from objdump -D dumps.

Mine :

   107a5:       49 8b 14 24             mov    (%r12),%rdx
   107a9:       48 01 d2                add    %rdx,%rdx
   107ac:       0f b6 30                movzbl (%rax),%esi
   107af:       48 ff c0                inc    %rax
   107b2:       48 03 14 f7             add    (%rdi,%rsi,8),%rdx
   107b6:       49 89 14 24             mov    %rdx,(%r12)
   107ba:       48 ff c3                inc    %rbx
   107bd:       49 85 d6                test   %rdx,%r14
   107c0:       75 de                   jne    107a0 <_ZN7rollsum7fastcdc7FastCDC15find_chunk_edge17hca396ceab3902345E+0x1c0>

@dswd's:

    f292:       4d 01 e4                add    %r12,%r12
    f295:       0f b6 84 1d 90 d7 ff    movzbl -0x2870(%rbp,%rbx,1),%eax
    f29c:       ff 
    f29d:       48 ff c3                inc    %rbx
    f2a0:       4c 03 a4 c5 68 cf ff    add    -0x3098(%rbp,%rax,8),%r12
    f2a7:       ff 
    f2a8:       4d 85 c4                test   %r8,%r12
    f2ab:       75 d3                   jne    f280 <_ZN4test13ns_iter_inner17h7be6bdb8f088759eE+0x5b0>

In mine code compiler fails to use register r12 directly for storing the digest, and keeps reading and writting it back through a pointer (%r12). Since that's two needless additional memory accesses vs just one that's necessary, it is plausible it would cause a difference bewteen 700MB/s and 2100MB/s that I see on my box.

Also there is an additional inc instruction, probably from enumerate() in find_chunk_edge_cond. I think find_chunk_edge_cond is also the reason why compiler fails to optimize better - and since I use in in benchmarks for gear, and bup, they are all affected so the benchmarks make such relatively small differences between them (still measurable, but smaller than expected).

find_chunk_edge_cond is a crappy method in the first place, and I am ashamed that I've invented it, IIRC. :D

dpc · 2017-08-04T05:34:54Z

The good part is that 2100MB/s seems doable (i was afraid it's just a misunderstanding of some kind), so it will push the bottleneck that I see often in rdedup quite far away.

dpc · 2017-08-04T07:41:14Z

find_chunk_edge_cond is not the problem. Seems like that digest field in the struct/trait is. Compiler just insists on updating it in place...

dpc · 2017-08-05T07:00:22Z

Since I've failed to find a way to get the compiler to optimize optimally, I've filled: rust-lang/rust#43653

dpc · 2017-08-06T00:23:17Z

The fix is relatively simple, at least for gear.

89010c1

Before:

test gear::tests::bench::perf_1mb ... bench:   1,421,243 ns/iter (+/- 4,696) = 737 MB/s

After:

test gear::tests::bench::perf_1mb ... bench:     544,688 ns/iter (+/- 28,115) = 1925 MB/s

Comparing to fastcdc by @dswd

bench:     478,465 ns/iter (+/- 28,868) = 2191 MB/s

The difference (I'm guessing) is because fastcdc is generaly faster because it's skipping some data, which gear is not doing. I'll redo fastcdc once we settle on API and #12 in general. Than we can measure again, and decide which impl to take.

dpc · 2017-08-06T00:31:36Z

Reminder to self: there is one cmp + test pair in both implementations that I don't see a reason for. Could be optimizable.

dpc · 2017-08-06T08:07:33Z

I have reworked and optimized things a bit more in #14 reaching a speed almost as good as @dsws 's code:

test fastcdc::tests::bench::perf_1mb_008k_chunks ... bench:     516,674 ns/iter (+/- 22,610) = 2029 MB/s

However, I've filled dswd/zvault#10 to show the cost of API that involves temporary copying.

dpc · 2017-08-06T08:07:49Z

Closing this one, we can continue in #14

dpc force-pushed the fastcdc branch 3 times, most recently from 51a067c to 4b9f37e Compare August 1, 2017 20:29

dpc added 4 commits August 1, 2017 13:31

Implement FastCDC on top of Gear

67e3710

Fixes

21bc966

Help with profiling the benchmarks

bba4778

Change random seed

f3e11ad

For reasons unknown, this seed generates weird results.

dpc force-pushed the fastcdc branch from 4b9f37e to f3e11ad Compare August 1, 2017 20:32

Fix inline on reset and docs

636af47

RAOF reviewed Aug 3, 2017

View reviewed changes

dpc commented Aug 3, 2017

View reviewed changes

dswd mentioned this pull request Aug 3, 2017

Combining efforts around data storage / deduplication? #9

Open

dpc mentioned this pull request Aug 5, 2017

Miscelanous RFC #12

Closed

dpc closed this Aug 6, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fastcdc #11

Fastcdc #11

dpc commented Jul 30, 2017

dpc commented Jul 30, 2017

dpc commented Jul 30, 2017

aidanhs commented Aug 2, 2017

dpc commented Aug 2, 2017 •

edited

Loading

dswd commented Aug 2, 2017 •

edited

Loading

dpc commented Aug 2, 2017 •

edited

Loading

RAOF left a comment

RAOF Aug 3, 2017

dpc Aug 3, 2017

RAOF Aug 3, 2017

dpc Aug 3, 2017 •

edited

Loading

RAOF Aug 3, 2017

RAOF Aug 3, 2017

dpc Aug 3, 2017

aidanhs Aug 3, 2017

RAOF commented Aug 3, 2017

dswd commented Aug 3, 2017

dpc Aug 3, 2017 •

edited

Loading

dpc commented Aug 3, 2017

dswd commented Aug 3, 2017

aidanhs commented Aug 3, 2017

dswd commented Aug 3, 2017

dpc commented Aug 4, 2017 •

edited

Loading

dpc commented Aug 4, 2017

dpc commented Aug 4, 2017

dpc commented Aug 5, 2017

dpc commented Aug 6, 2017

dpc commented Aug 6, 2017

dpc commented Aug 6, 2017

dpc commented Aug 6, 2017

Fastcdc #11

Fastcdc #11

Conversation

dpc commented Jul 30, 2017

dpc commented Jul 30, 2017

dpc commented Jul 30, 2017

aidanhs commented Aug 2, 2017

dpc commented Aug 2, 2017 • edited Loading

dswd commented Aug 2, 2017 • edited Loading

dpc commented Aug 2, 2017 • edited Loading

RAOF left a comment

Choose a reason for hiding this comment

RAOF Aug 3, 2017

Choose a reason for hiding this comment

dpc Aug 3, 2017

Choose a reason for hiding this comment

RAOF Aug 3, 2017

Choose a reason for hiding this comment

dpc Aug 3, 2017 • edited Loading

Choose a reason for hiding this comment

RAOF Aug 3, 2017

Choose a reason for hiding this comment

RAOF Aug 3, 2017

Choose a reason for hiding this comment

dpc Aug 3, 2017

Choose a reason for hiding this comment

aidanhs Aug 3, 2017

Choose a reason for hiding this comment

RAOF commented Aug 3, 2017

dswd commented Aug 3, 2017

dpc Aug 3, 2017 • edited Loading

Choose a reason for hiding this comment

dpc commented Aug 3, 2017

dswd commented Aug 3, 2017

aidanhs commented Aug 3, 2017

dswd commented Aug 3, 2017

dpc commented Aug 4, 2017 • edited Loading

dpc commented Aug 4, 2017

dpc commented Aug 4, 2017

dpc commented Aug 5, 2017

dpc commented Aug 6, 2017

dpc commented Aug 6, 2017

dpc commented Aug 6, 2017

dpc commented Aug 6, 2017

dpc commented Aug 2, 2017 •

edited

Loading

dswd commented Aug 2, 2017 •

edited

Loading

dpc commented Aug 2, 2017 •

edited

Loading

dpc Aug 3, 2017 •

edited

Loading

dpc Aug 3, 2017 •

edited

Loading

dpc commented Aug 4, 2017 •

edited

Loading